Introduction


In [1]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the (sample) input tables


In [2]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

# Get the paths of the input tables
path = datasets_dir + os.sep + 'dblp_demo.csv'

In [3]:
# Read the CSV file and set 'ID' as the key attribute
A = em.read_csv_metadata(path, key='id')
B = em.read_csv_metadata(path, key='id')
A.head()


Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.
Out[3]:
id title authors venue year
0 l0 Paradise: A Database System for GIS Applications Paradise Team SIGMOD Conference 1995
1 l1 A Query Language and Optimization Techniques for Unstructured Data Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu SIGMOD Conference 1996
2 l2 Turbo-charging Vertical Mining of Large Databases Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia SIGMOD Conference 2000
3 l3 Maintenance of Data Cubes and Summary Tables in a Warehouse Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick SIGMOD Conference 1997
4 l4 On Relational Support for XML Publishing: Beyond Sorting and Tagging Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri SIGMOD Conference 2003

Data Exploration

This notebook will demonstrate using two different data exploration tools. OpenRefine is supported for python 2.7 and 3.5 and PandasTable is only supported for python 3.5

OpenRefine


In [4]:
# Invoke the open refine gui for data exploration
p = em.data_explore_openrefine(A, name='Table')

In [5]:
# Save the project back to our dataframe
# after calling export_pandas_frame, the openRefine project will be deleted automatically
A = p.export_pandas_frame()

In [6]:
A.head()


Out[6]:
id title authors venue year
0 l0 You can modify data if necessary using OpenRefine Paradise Team SIGMOD Conference 1995
1 l1 A Query Language and Optimization Techniques for Unstructured Data Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu SIGMOD Conference 1996
2 l2 Turbo-charging Vertical Mining of Large Databases Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia SIGMOD Conference 2000
3 l3 Maintenance of Data Cubes and Summary Tables in a Warehouse Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick SIGMOD Conference 1997
4 l4 On Relational Support for XML Publishing: Beyond Sorting and Tagging Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri SIGMOD Conference 2003

Pandastable


In [7]:
# Invoke the pandastable gui for data exploration
# The process will be blocked until closing the GUI
em.data_explore_pandastable(B)

In [8]:
B.head()


Out[8]:
id title authors venue year
0 l0 You can modify data if necessary using pandastable Paradise Team SIGMOD Conference 1995
1 l1 A Query Language and Optimization Techniques for Unstructured Data Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu SIGMOD Conference 1996
2 l2 Turbo-charging Vertical Mining of Large Databases Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia SIGMOD Conference 2000
3 l3 Maintenance of Data Cubes and Summary Tables in a Warehouse Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick SIGMOD Conference 1997
4 l4 On Relational Support for XML Publishing: Beyond Sorting and Tagging Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri SIGMOD Conference 2003

In [ ]: